lifelong learning agent
Learning Without Time-Based Embodiment Resets in Soft-Actor Critic
Farrahi, Homayoon, Mahmood, A. Rupam
When creating new reinforcement learning tasks, practitioners often accelerate the learning process by incorporating into the task several accessory components, such as breaking the environment interaction into independent episodes and frequently resetting the environment. Although they can enable the learning of complex intelligent behaviors, such task accessories can result in unnatural task setups and hinder long-term performance in the real world. In this work, we explore the challenges of learning without episode terminations and robot embodiment resets using the Soft Actor-Critic (SAC) algorithm. To learn without terminations, we present a continuing version of the SAC algorithm and show that, with simple modifications to the reward functions of existing tasks, continuing SAC can perform as well as or better than episodic SAC while reducing the sensitivity of performance to the value of the discount rate $γ$. On a modified Gym Reacher task, we investigate possible explanations for the failure of continuing SAC when learning without embodiment resets. Our results suggest that embodiment resets help with exploration of the state space in the SAC algorithm, and removing embodiment resets can lead to poor exploration of the state space and failure of or significantly slower learning. Finally, on additional simulated tasks and a real-robot vision task, we show that increasing the entropy of the policy when performance trends worse or remains static is an effective intervention for recovering the performance lost due to not using embodiment resets.
- North America > Canada > Alberta (0.14)
- Asia > Middle East > Jordan (0.04)
Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models
Goswami, Dipam, Wang, Liying, Twardowski, Bartłomiej, van de Weijer, Joost
Text embedding models enable semantic search, powering several NLP applications like Retrieval Augmented Generation by efficient information retrieval (IR). However, text embedding models are commonly studied in scenarios where the training data is static, thus limiting its applications to dynamic scenarios where new training data emerges over time. IR methods generally encode a huge corpus of documents to low-dimensional embeddings and store them in a database index. During retrieval, a semantic search over the corpus is performed and the document whose embedding is most similar to the query embedding is returned. When updating an embedding model with new training data, using the already indexed corpus is suboptimal due to the non-compatibility issue, since the model which was used to obtain the embeddings of the corpus has changed. While re-indexing of old corpus documents using the updated model enables compatibility, it requires much higher computation and time. Thus, it is critical to study how the already indexed corpus can still be effectively used without the need of re-indexing. In this work, we establish a continual learning benchmark with large-scale datasets and continually train dense retrieval embedding models on query-document pairs from new datasets in each task and observe forgetting on old tasks due to significant drift of embed-dings. We employ embedding distillation on both query and document embeddings to maintain stability and propose a novel query drift compensation method during retrieval to project new model query embeddings to the old embedding space. This enables compatibility with previously indexed corpus embeddings extracted using the old model and thus reduces the forgetting. We show that the proposed method significantly improves performance without any re-indexing.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
- North America > United States > North Carolina (0.04)
- (4 more...)
- Media > Television (1.00)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- (2 more...)
Reinitializing weights vs units for maintaining plasticity in neural networks
Hernandez-Garcia, J. Fernando, Dohare, Shibhansh, Luo, Jun, Sutton, Rich S.
Loss of plasticity is a phenomenon in which a neural network loses its ability to learn when trained for an extended time on non-stationary data. It is a crucial problem to overcome when designing systems that learn continually. An effective technique for preventing loss of plasticity is reinitializing parts of the network. In this paper, we compare two different reinitialization schemes: reinitializing units vs reinitializing weights. We propose a new algorithm, which we name \textit{selective weight reinitialization}, for reinitializing the least useful weights in a network. We compare our algorithm to continual backpropagation and ReDo, two previously proposed algorithms that reinitialize units in the network. Through our experiments in continual supervised learning problems, we identify two settings when reinitializing weights is more effective at maintaining plasticity than reinitializing units: (1) when the network has a small number of units and (2) when the network includes layer normalization. Conversely, reinitializing weights and units are equally effective at maintaining plasticity when the network is of sufficient size and does not include layer normalization. We found that reinitializing weights maintains plasticity in a wider variety of settings than reinitializing units.
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)
- Water & Waste Management > Water Management (0.46)
- Education (0.36)
Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models
Abbes, Istabrak, Subbaraj, Gopeshh, Riemer, Matthew, Islah, Nizar, Therien, Benjamin, Tabaru, Tsuguchika, Kingetsu, Hiroaki, Chandar, Sarath, Rish, Irina
Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) (Riemer et al., 2019a) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples. Large Language Models (LLMs) need regular updates to be current with new information and domains, posing a problem for organizations looking to maintain LLMs without repeatedly performing expensive retraining from scratch. Performing updates to a model that has already received pre-training on a new distribution is the classic problem of continual learning (Ring, 1994) or lifelong learning (Thrun, 1994). We should draw a strong distinction between this setting and other settings such as fine-tuning or instruction tuning, which are generally characterized by training on much smaller datasets for a much smaller number of gradient steps.
- North America > United States > Texas > Travis County > Austin (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > Canada > Quebec (0.04)
- Education (0.88)
- Government > Regional Government (0.46)
What Can Grokking Teach Us About Learning Under Nonstationarity?
Lyle, Clare, Sokar, Gharda, Pascanu, Razvan, Gyorgy, Andras
In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to changes in the data stream; however, neural networks often exhibit primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks. Non-stationarity is ubiquitous in real-world applications of AI systems: datasets may grow over time, correlations may appear and then disappear as trends evolve, and AI systems themselves may take an active role in the generation of their own training data. In this paper, we will propose a framework for understanding and mitigating this degradation in generalization performance which connects three previously disparate phenomena: primacy bias, grokking, and feature-learning dynamics. Primacy bias: A neural network initially trained on one task is trained on a different data distribution and/or objective, and achieves worse performance than a randomly initialized network on the new task (Achille et al., 2017; Ash & Adams, 2020; Nikishin et al., 2022). Grokking: A model suddenly closes the generalization gap as a result of (possibly prolonged) further training after it has initially achieved perfect training accuracy (memorization) with poor test-time performance (Power et al., 2022). Feature learning: a network's ability to make nontrivial changes to its learned representation (a.k.a.
- North America > United States (0.28)
- Asia > Middle East > Jordan (0.04)
- Instructional Material (0.65)
- Research Report (0.64)
- Education > Focused Education > Special Education (0.65)
- Education > Educational Setting > Continuing Education (0.51)
Balancing Expressivity and Robustness: Constrained Rational Activations for Reinforcement Learning
Surdej, Rafał, Bortkiewicz, Michał, Lewandowski, Alex, Ostaszewski, Mateusz, Lyle, Clare
Trainable activation functions, whose parameters are optimized alongside network weights, offer increased expressivity compared to fixed activation functions. Specifically, trainable activation functions defined as ratios of polynomials (rational functions) have been proposed to enhance plasticity in reinforcement learning. However, their impact on training stability remains unclear. In this work, we study trainable rational activations in both reinforcement and continual learning settings. We find that while their flexibility enhances adaptability, it can also introduce instability, leading to overestimation in RL and feature collapse in longer continual learning scenarios. Our main result is demonstrating a trade-off between expressivity and plasticity in rational activations. To address this, we propose a constrained variant that structurally limits excessive output scaling while preserving adaptability. Experiments across MetaWorld and DeepMind Control Suite (DMC) environments show that our approach improves training stability and performance. In continual learning benchmarks, including MNIST with reshuffled labels and Split CIFAR-100, we reveal how different constraints affect the balance between expressivity and long-term retention. While preliminary experiments in discrete action domains (e.g., Atari) did not show similar instability, this suggests that the trade-off is particularly relevant for continuous control. Together, our findings provide actionable design principles for robust and adaptable trainable activations in dynamic, non-stationary environments. Figure 1: Interquartile Mean (IQM) performance after 1M environment steps, aggregated across 15 MetaWorld and 15 DeepMind Control Suite (DMC) environments. For Meta-World, we measure the score, while for DMC, returns are divided by 1000 to match the upper performance bound. We compare Original Rationals (OR), our Constrained Rationals (CR), ReLU, and ReLU with Layer Normalization (LN), all trained with resets. Our results show that CR + Resets achieves the highest overall performance, highlighting the benefits of our proposed constraints in stabilizing RL training. Neural network expressivity is a key factor in reinforcement learning (RL), particularly in dynamic environments where agents must continuously adapt. While most RL architectures rely on static activation functions, recent work suggests that allowing activations to be trainable could enhance adaptability by increasing the flexibility of individual neurons.
- Europe > Poland > Masovia Province > Warsaw (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (3 more...)
A Simple Baseline for Stable and Plastic Neural Networks
Künzel, Étienne, Jaziri, Achref, Ramesh, Visvanathan
Continual learning in computer vision requires that models adapt to a continuous stream of tasks without forgetting prior knowledge, yet existing approaches often tip the balance heavily toward either plasticity or stability. We introduce RDBP, a simple, low-overhead baseline that unites two complementary mechanisms: ReLUDown, a lightweight activation modification that preserves feature sensitivity while preventing neuron dormancy, and Decreasing Backpropagation, a biologically inspired gradient-scheduling scheme that progressively shields early layers from catastrophic updates. Evaluated on the Continual ImageNet benchmark, RDBP matches or exceeds the plasticity and stability of state-of-the-art methods while reducing computational cost. RDBP thus provides both a practical solution for real-world continual learning and a clear benchmark against which future continual learning strategies can be measured. Continual learning in computer vision tackles the fundamental challenge of enabling models to adapt to a continuous stream of visual information rather than to a single static dataset. Such systems must continuously integrate new concepts while retaining the features and representations learned from previous tasks.
CLA: Latent Alignment for Online Continual Self-Supervised Learning
Cignoni, Giacomo, Cossu, Andrea, Gomez-Villa, Alexandra, van de Weijer, Joost, Carta, Antonio
Self-supervised learning (SSL) is able to build latent representations that generalize well to unseen data. However, only a few SSL techniques exist for the online CL setting, where data arrives in small minibatches, the model must comply with a fixed computational budget, and task boundaries are absent. We introduce Continual Latent Alignment (CLA), a novel SSL strategy for Online CL that aligns the representations learned by the current model with past representations to mitigate forgetting. We found that our CLA is able to speed up the convergence of the training process in the online scenario, outperforming state-of-the-art approaches under the same computational budget. Surprisingly, we also discovered that using CLA as a pretraining protocol in the early stages of pretraining leads to a better final performance when compared to a full i.i.d. pretraining.
- Europe > Austria > Vienna (0.14)
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (8 more...)
Combining Pre-Trained Models for Enhanced Feature Representation in Reinforcement Learning
Piccoli, Elia, Li, Malio, Carfì, Giacomo, Lomonaco, Vincenzo, Bacciu, Davide
The recent focus and release of pre-trained models have been a key components to several advancements in many fields (e.g. Natural Language Processing and Computer Vision), as a matter of fact, pre-trained models learn disparate latent embeddings sharing insightful representations. On the other hand, Reinforcement Learning (RL) focuses on maximizing the cumulative reward obtained via agent's interaction with the environment. RL agents do not have any prior knowledge about the world, and they either learn from scratch an end-to-end mapping between the observation and action spaces or, in more recent works, are paired with monolithic and computationally expensive Foundational Models. How to effectively combine and leverage the hidden information of different pre-trained models simultaneously in RL is still an open and understudied question. In this work, we propose Weight Sharing Attention (WSA), a new architecture to combine embeddings of multiple pre-trained models to shape an enriched state representation, balancing the tradeoff between efficiency and performance. We run an extensive comparison between several combination modes showing that WSA obtains comparable performance on multiple Atari games compared to end-to-end models. Furthermore, we study the generalization capabilities of this approach and analyze how scaling the number of models influences agents' performance during and after training.
- North America > Canada > Alberta (0.14)
- Europe > Austria > Vienna (0.14)
- North America > Canada > British Columbia > Vancouver (0.04)
- (10 more...)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Education (1.00)
Gradual Fine-Tuning with Graph Routing for Multi-Source Unsupervised Domain Adaptation
Ma, Yao, Louvan, Samuel, Wang, Zhunxuan
Multi-source unsupervised domain adaptation aims to leverage labeled data from multiple source domains for training a machine learning model to generalize well on a target domain without labels. Source domain selection plays a crucial role in determining the model's performance. It relies on the similarities amongst source and target domains. Nonetheless, existing work for source domain selection often involves heavyweight computational procedures, especially when dealing with numerous source domains and the need to identify the best ones from them. In this paper, we introduce a framework for gradual fine tuning (GFT) of machine learning models on multiple source domains. We represent multiple source domains as an undirected weighted graph. We then give a new generalization error bound for GFT along any path within the graph, which is used to determine the optimal path corresponding to the optimal training order. With this formulation, we introduce three lightweight graph-routing strategies which tend to minimize the error bound. Our best strategy improves $2.3\%$ of accuracy over the state-of-the-art on Natural Language Inference (NLI) task and achieves competitive performance on Sentiment Analysis (SA) task, especially a $3.9\%$ improvement on a more diverse subset of data we use for SA.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (10 more...)